{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing for...\n",
    "\n",
    "* Ballroom, extended ballroom, gtzan (genre)\n",
    "* emomusic45s (emotion)\n",
    "* jamendo (vocal or not)\n",
    "* gtzan music speech\n",
    "\n",
    "### like how?\n",
    "For each dataset,\n",
    "* Audio: decoded, resampled to 12 kHz, and stored as filename.npy\n",
    "* Annotations: becomes a csv file, `[dataset_name].csv` file with binary values\n",
    "  * [file_id]: file id -- for management\n",
    "  * [index]: file index (integer) -- for management\n",
    "  * [filepath]: filepath -- for reading file\n",
    "  * [label]: y in integer -- for stratified spliting\n",
    "* Annotations: `[dataset_name].npy`: `n_track, 1`\n",
    "* Annotations: `[dataset_name]_1hot.npy`: `(n_track, n_label)` if classification task\n",
    "* Setup: `[dataset_name]_setup.json` for setup like\n",
    "  * task name\n",
    "  * task type (classification or regression)\n",
    "  * suggested `k` for cross validation\n",
    "  * dataset root path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#!/usr/bin/env python\n",
    "# -*- coding: utf-8 -*-\n",
    "%matplotlib notebook\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import h5py\n",
    "import librosa\n",
    "import os, sys\n",
    "import time\n",
    "import pandas as pd \n",
    "from collections import namedtuple\n",
    "from sklearn.model_selection import StratifiedShuffleSplit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PATHS\n",
    "You should change these to where you saved/untarred the datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# For almost every datasets,\n",
    "PATH_DATASETS = '/misc/kcgscratch1/ChoGroup/keunwoo/datasets/'\n",
    "# To store processed Jamendo dataset\n",
    "PATH_PROCESSED = '/misc/kcgscratch1/ChoGroup/keunwoo/datasets_processed/'\n",
    "# For some random reason I put UrbanSound8K in this folder. \n",
    "PATH_URBAN = '/misc/kcgscratch1/ChoGroup/keunwoo/UrbanSound8K' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "FOLDER_CSV = 'data_csv/'\n",
    "SR = 12000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "allowed_exts = set(['mp3', 'wav', 'au'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# Create CSV files\n",
    "\n",
    "Helper functions for other datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "column_names = ['id', 'filepath', 'label'] # todo: label_for_stratify?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def get_rows_from_folders(folder_dataset, folders, dataroot=None):\n",
    "    '''gtzan, ballroom extended. each class in different folders'''\n",
    "    rows = []\n",
    "    if dataroot is None:\n",
    "        dataroot = PATH_DATASETS\n",
    "    for label_idx, folder in enumerate(folders): # assumes different labels per folders.\n",
    "        files = os.listdir(os.path.join(dataroot, folder_dataset, folder))\n",
    "        files = [f for f in files if f.split('.')[-1].lower() in allowed_exts]\n",
    "        for fname in files:\n",
    "            file_path = os.path.join(folder_dataset, folder, fname)\n",
    "            file_id =fname.split('.')[0]\n",
    "            file_label = label_idx\n",
    "            rows.append([file_id, file_path, file_label])\n",
    "    print('Done - length:{}'.format(len(rows)))\n",
    "    print(rows[0])\n",
    "    print(rows[-1])\n",
    "    return rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def write_to_csv(rows, column_names, csv_fname):\n",
    "    '''rows: list of rows (= which are lists.)\n",
    "    column_names: names for columns\n",
    "    csv_fname: string, csv file name'''\n",
    "    df = pd.DataFrame(rows, columns=column_names)\n",
    "    df.to_csv(os.path.join(FOLDER_CSV, csv_fname))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Jamendo voice activity\n",
    "pre-preprocess: trim them and save into `PATH_JAMENDO_TRIM`. \n",
    "\n",
    "The `train/test/valid` structure is preserved, segments are stored in `sing/nosing` subdirectory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "PATH_JAMENDO = os.path.join(PATH_DATASETS, 'jamendo_voice_activity')\n",
    "PATH_JAMENDO_TRIM = os.path.join(PATH_PROCESSED, 'jamendo_trimmed/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "folders = ['train', 'test', 'valid']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Jamendo - x\n",
    "\n",
    "This code trims the files into 'sing' and 'nosing' segments and save them as wav files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Start..\n",
      "train..\n",
      "  4402 - Colombia.ogg..\n",
      "  4501 - The Final Rewind.ogg..\n",
      "  4601 - Sunlight.ogg..\n",
      "  4701 - Seven Months.ogg..\n",
      "  4801 - Perdre le Nord.ogg..\n",
      "  4901 - Ok.ogg..\n",
      "  5001 - Its Easy.ogg..\n",
      "  5101 - Angels Of Crime.ogg..\n",
      "  5201 - Visa pour hier.mp3..\n",
      "  5301 - Sunken Sailor.mp3..\n",
      "  5402 - The Louise XIV Cathorse.mp3..\n",
      "  5501 - alice.mp3..\n",
      "  5601 - A smile on your face.mp3..\n",
      "  5701 - A new singing song.mp3..\n",
      "  5801 - 10min.mp3..\n",
      "  5901 - A city.mp3..\n",
      "  6002 - emporte-moi.mp3..\n",
      "test..\n",
      "  003 - castaway.ogg..\n",
      "  103 - Une charogne.ogg..\n",
      "  205 - 05 LIrlandaise.ogg..\n",
      "  305 - 16 ans.ogg..\n",
      "  405 - 2003-Circonstances attenuantes part II.ogg..\n",
      "  505 - A Poings Fermes.ogg..\n",
      "  605 - Crepuscule.ogg..\n",
      "  705 - Dance.ogg..\n",
      "  804 - Healing Luna.mp3..\n",
      "  903 - Say me Good Bye.mp3..\n",
      "  1004 - Inside.mp3..\n",
      "  1105 - Elles disent.mp3..\n",
      "  1203 - Si Dieu.mp3..\n",
      "  1303 - School.mp3..\n",
      "  1404 - Believe.mp3..\n",
      "  1504 - You are.mp3..\n",
      "valid..\n",
      "  005 - Change.ogg..\n",
      "  105 - Cecilia.ogg..\n",
      "  205 - Callypige palindrome.ogg..\n",
      "  305 - Budenzauber.ogg..\n",
      "  405 - Bucle Paranoideal.ogg..\n",
      "  505 - Bleu horizon.ogg..\n",
      "  605 - Black Box.ogg..\n",
      "  705 - Bazooka enraye.ogg..\n",
      "  805 - Bandicape.ogg..\n",
      "  905 - Aux Cieux, Tu es mon Dieu.ogg..\n",
      "  1005 - Au Mage.ogg..\n",
      "  1105 - As Longas Passagens.ogg..\n",
      "  1205 - Argeles.ogg..\n",
      "  1305 - Anyway.ogg..\n",
      "  1405 - Animal.ogg..\n",
      "  1505 - Angeliques.ogg..\n"
     ]
    }
   ],
   "source": [
    "# Create 'x's \n",
    "print('Start..')\n",
    "for folder in folders:\n",
    "    print('{}..'.format(folder))\n",
    "    path_folder = os.path.join(PATH_JAMENDO, folder)\n",
    "    files = os.listdir(path_folder)\n",
    "    music_files = [f for f in files if f.split('.')[-1].lower() in ('ogg', 'mp3') and not f.startswith('._')]\n",
    "    \n",
    "    lab_files = [f.replace('ogg', 'lab').replace('mp3', 'lab') for f in music_files]\n",
    "    try:\n",
    "        os.mkdir(os.path.join(PATH_JAMENDO_TRIM, folder, 'sing/'))\n",
    "        os.mkdir(os.path.join(PATH_JAMENDO_TRIM, folder, 'nosing/'))\n",
    "    except:\n",
    "        pass\n",
    "    for file_idx, (m_file, l_file) in enumerate(zip(music_files, lab_files)): # music file, lab file (text)\n",
    "        if folder == 'train' and file_idx <= 43:\n",
    "            continue\n",
    "        print('  {} {}..'.format(file_idx, m_file))\n",
    "        filename = l_file.rstrip('.lab') # == song title.\n",
    "        path_audio = os.path.join(path_folder, m_file)\n",
    "        src, sr = librosa.load(path_audio, sr=SR)\n",
    "        starts, ends, labels = [], [], []\n",
    "        with open(os.path.join(PATH_JAMENDO, 'jamendo_lab/', l_file)) as f_lab:\n",
    "            for line in f_lab:\n",
    "                start, end, label = line.rstrip('\\n').split(' ') # [second], [s], 'sing' or 'nosing'\n",
    "                starts.append(start)\n",
    "                ends.append(end)\n",
    "                labels.append(label)\n",
    "        for seg_idx, (start, end, label) in enumerate(zip(starts, ends, labels)):\n",
    "            out_name = '{}_{}.wav'.format(filename, seg_idx)\n",
    "            start_sp = int(float(start) * SR)\n",
    "            end_sp = int(float(end) * SR)\n",
    "            librosa.output.write_wav(os.path.join(PATH_JAMENDO_TRIM, folder, label, out_name),\n",
    "                                    src[start_sp:end_sp],\n",
    "                                    SR)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### jamendo - y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4086\n",
      "4086\n"
     ]
    }
   ],
   "source": [
    "# y: csv file, columns=['filename_total', 'filename_seg', 'y', 'category']\n",
    "#                       'blah.ogg', 'blah_1.ogg', 1, 'train'\n",
    "filename_segs = []\n",
    "filepaths = []\n",
    "ys = []\n",
    "categories = []\n",
    "\n",
    "for folder in folders:\n",
    "#     print('{}..'.format(folder))\n",
    "    path_folder = os.path.join(PATH_JAMENDO, folder)\n",
    "    files = os.listdir(path_folder)\n",
    "    music_files = [f for f in files if f.split('.')[-1].lower() in ('ogg', 'mp3') and not f.startswith('._')]\n",
    "    \n",
    "    lab_files = [f.replace('ogg', 'lab').replace('mp3', 'lab') for f in music_files]\n",
    "    try:\n",
    "        os.mkdir(os.path.join(PATH_JAMENDO_TRIM, folder, 'sing/'))\n",
    "        os.mkdir(os.path.join(PATH_JAMENDO_TRIM, folder, 'nosing/'))\n",
    "    except:\n",
    "        pass\n",
    "    for file_idx, (m_file, l_file) in enumerate(zip(music_files, lab_files)): # music file, lab file (text)\n",
    "#         print('  {} {}..'.format(file_idx, m_file))\n",
    "        filename = l_file.rstrip('.lab') # == song title.\n",
    "        is_sings = []\n",
    "        labels = []\n",
    "        with open(os.path.join(PATH_JAMENDO, 'jamendo_lab/', l_file)) as f_lab:\n",
    "            for line in f_lab:\n",
    "                start, end, label = line.rstrip('\\n').split(' ') # [second], [s], 'sing' or 'nosing'\n",
    "                is_sings.append(int(label == 'sing'))\n",
    "                labels.append(label)\n",
    "        for seg_idx, (is_sing, label) in enumerate(zip(is_sings, labels)):\n",
    "            out_name = '{}_{}.wav'.format(filename, seg_idx)\n",
    "            filename_segs.append(out_name.rstrip('.wav'))\n",
    "            filepaths.append(os.path.join(folder, label, out_name))\n",
    "            ys.append(is_sing)\n",
    "            categories.append(folder)\n",
    "\n",
    "print len(filepaths)\n",
    "print len(ys)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "jamendo_vd.csv is saved! \n"
     ]
    }
   ],
   "source": [
    "write_to_csv(zip(*[filename_segs, filepaths, ys, categories]), ['id', 'filepath', 'label', 'category'], 'jamendo_vd.csv')\n",
    "print('jamendo_vd.csv is saved! ')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ballroom extended"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "folder_dataset_be = 'ballroom_extended_2016/'\n",
    "labels_be = ['Chacha', 'Foxtrot', 'Jive', 'Pasodoble', 'Rumba', 'Salsa', 'Samba', 'Slowwaltz', 'Tango', 'Viennesewaltz'\n",
    "         , 'Waltz', 'Wcswing', 'Quickstep']\n",
    "n_label_be = len(labels_be)\n",
    "folders_be = [s + '/' for s in labels_be]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done - length:4180\n",
      "['100701', 'ballroom_extended_2016/Chacha/100701.mp3', 0]\n",
      "['118720', 'ballroom_extended_2016/Quickstep/118720.mp3', 12]\n"
     ]
    }
   ],
   "source": [
    "rows_ballroom = get_rows_from_folders(folder_dataset_be, folders_be)\n",
    "write_to_csv(rows_ballroom, column_names, 'ballroom_extended.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### gtzan genre"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "folder_dataset_gtg = 'gtzan_genre/genres/'\n",
    "labels_gtg = ['blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal', 'pop', 'reggae', 'rock']\n",
    "n_label_gtg = len(labels_gtg)\n",
    "folders_gtg = [s + '/' for s in labels_gtg]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done - length:1000\n",
      "['blues', 'gtzan_genre/genres/blues/blues.00000.au', 0]\n",
      "['rock', 'gtzan_genre/genres/rock/rock.00099.au', 9]\n"
     ]
    }
   ],
   "source": [
    "rows_gtg = get_rows_from_folders(folder_dataset_gtg, folders_gtg)\n",
    "write_to_csv(rows_gtg, column_names, 'gtzan_genre.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### gtzan music speech"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "folder_dataset_gtms = 'gtzan_music_speech/music_speech/'\n",
    "labels_gtms = ['music', 'speech']\n",
    "n_label_gtms = len(labels_gtms)\n",
    "folders_gtms = ['music_wav', 'speech_wav']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Done - length:128\n",
      "['bagpipe', 'gtzan_music_speech/music_speech/music_wav/bagpipe.wav', 0]\n",
      "['voices', 'gtzan_music_speech/music_speech/speech_wav/voices.wav', 1]\n"
     ]
    }
   ],
   "source": [
    "rows_gtms = get_rows_from_folders(folder_dataset_gtms, folders_gtms)\n",
    "write_to_csv(rows_gtms, column_names, 'gtzan_speechmusic.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### emoMusic45s\n",
    "* at /misc/kcgscratch1/ChoGroup/keunwoo/datasets/emoMusic45s,\n",
    "* files: 0.mp3 - 1000.mp3\n",
    "* labels: `static_annotations.csv`, \n",
    "```csv\n",
    "song_id,mean_arousal,std_arousal,mean_valence,std_valence\n",
    "2,3.1,0.99443,3,0.66667\n",
    "```\n",
    "\n",
    "Some files are missing AV labels, overall 744 songs. \n",
    "\n",
    "I ignore train/test pre-split cuz it seems unnecessary.\n",
    "\n",
    "http://cvml.unige.ch/databases/emoMusic/\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### emoMusic 45s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "emoMusic_folder = 'emoMusic45s'\n",
    "anno_file = 'static_annotations.csv'\n",
    "info_file = 'song_info.csv'\n",
    "df = pd.read_csv(os.path.join(PATH_DATASETS, emoMusic_folder, anno_file))\n",
    "# df_info = pd.read_csv(os.path.join(PATH_DATASETS, emoMusic_folder, info_file))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "filenames = ['%s.mp3' % idx for idx in df['song_id']]\n",
    "filepaths = [os.path.join(emoMusic_folder, 'clips_45seconds', f) for f in filenames]\n",
    "# categories = [df['']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "write_to_csv(zip(*[df['song_id'], filepaths, df['mean_arousal']/9., df['mean_valence']/9.]),\n",
    "            ['id', 'filepath', 'label_arousal', 'label_valence'],\n",
    "            'emoMusic.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "744"
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df['song_id'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### urbansound8k\n",
    "\n",
    "csv: `/misc/kcgscratch1/ChoGroup/keunwoo/UrbanSound8K/metadata/UrbanSound8K.csv`\n",
    "```\n",
    "slice_file_name,fsID,start,end,salience,fold,classID,class\n",
    "100032-3-0-0.wav,100032,0.0,0.317551,1,5,3,dog_bark\n",
    "100263-2-0-117.wav,100263,58.5,62.5,1,5,2,children_playing\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "df = pd.read_csv(os.path.join(PATH_URBAN, 'metadata/UrbanSound8K.csv'))\n",
    "\n",
    "ids = [s.rstrip('.wav') for s in df['slice_file_name']]\n",
    "filepaths = [os.path.join('fold%s' % fd, fn) for fd, fn in zip(df['fold'], df['slice_file_name'])]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "write_to_csv(zip(*[ids, filepaths, df['classID'], df['fold']]), ['id', 'filepath', 'label', 'fold'],\n",
    "            'urbansound.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index([u'slice_file_name', u'fsID', u'start', u'end', u'salience', u'fold',\n",
      "       u'classID', u'class'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "print df.columns"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
